在我們的架構中,有一隻 Cronjob 會產出 report,像是 Payment, Member info 等,某天他突然掛掉,導致 report 沒有正常產出,發現這個問題後我們開始排查。
$ k describe job generate-payment-history-report-cronjob-manual-20221026
Name: generate-payment-history-report-cronjob-manual-20221026.
.
.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 2m46s (x29 over 145m) job-controller Error creating: Internal error occurred: failed calling webhook "mpod.appmesh.k8s.aws": failed to call webhook: Post "https://appmesh-controller-webhook-service.appmesh-system.svc:443/mutate-v1-pod?timeout=30s": no endpoints available for service "appmesh-controller-webhook-service"
看起來是 appmesh-controller 出了點問題,我們再細查 appmesh-controller 相關 endpoints,發現真的不見了
k get endpoints -n appmesh-system
NAME ENDPOINTS AGE
appmesh-controller-webhook-service 657d
由於是 appmesh controller 的問題,屬於 Cluster 內行為。剛好我們 cluster 有開啟 Control Plane 相關 Log Group,所以我們追查到 Control Plane Log 的 Kubernetes API server component logs 跟 Controller manager。
我們在 API Server Log 中有看到以下重複的 Error
===== API Error Logs =====
{"@timestamp":"xxx","@message":"I1025 xxx 9 healthz.go:261] etcd check failed: healthz\n[-]etcd failed: error getting data from etcd: context deadline exceeded"}
{"@timestamp":"xxx","@message":"I1025 xxx 9 healthz.go:261] etcd check failed: healthz\n[-]etcd failed: error getting data from etcd: context deadline exceeded\n{\"level\":\"warn\",\"ts\":\"2022-10-25T08:38:27.193Z\",\"logger\":\"etcd-client\",\"caller\":\"v3/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc000e2aa80/#initially=[
並且可以看到事件發生時 API server 和 kube-scheduler 的 Healthcheck 有瞬間掉下去一下。
總結到此,我們大概可以知道幾個脈絡